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Abstract 

We investigate the optimality for model selection of the so-called slope heuristics, V-fold cross-validation 
and V-fold penalization in a heteroscedatic with random design regression context. We consider a new class 
of linear models that we call strongly localized bases and that generalize histograms, piecewise polynomials 
and compactly supported wavelets. We derive sharp oracle inequalities that prove the asymptotic optimality 
of the slope heuristics—when the optimal penalty shape is known—and V-fold penalization. Furthermore, 

V-fold cross-validation seems to be suboptimal for a fixed value of V since it recovers asymptotically the 
oracle learned from a sample size equal to 1 — V~^ of the original amount of data. Our results are based on 
genuine concentration inequalities for the true and empirical excess risks that are of independent interest. 

We show in our experiments the good behavior of the slope heuristics for the selection of linear wavelet 
models. Furthermore, V-fold cross-validation and V-fold penalization have comparable efficiency. 
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Introduction 

The main goal of this paper is to substantially extend the study, in a heteroscedastic regression with random 
design context, of the optimality of two general model selection devices: the so-called slope heuristics and V- 
fold resampling strategies. More precisely, we consider projection estimators on some general linear models and 
investigate from a theoretical perspective the possibility to derive optimal oracle inequalities for the considered 
model selection procedures. We also experiment and compare the procedures for the selection of linear wavelet 
models. 

The slope heuristics [12] is a recent calibration method of penalization procedures in model selection: from 
the knowledge of a (good) penalty shape it allows to calibrate a penalty that performs an accurate model 
selection. It is based on the existence of a minimal penalty, around which there is a drastic change in the 
behavior of the model selection procedure. Moreover, the optimal penalty is simply linked to the minimal one 
by a factor two. The slope heuristics is thus a general method for the selection of M-estimators [9] and it has 
been successfully applied in various methodological studies surveyed in [10]. 

However, there is a gap between the wide range of applicability of the slope heuristics and its theoretical 
justification. Indeed, there are only a few studies, in quite restrictive frameworks, that theoretically describe the 
optimality of this penalty calibration procedure. First, Birge and Massart [12] have shown the validity of the 
slope heuristics in a generalized linear Gaussian model setting, including the case of homoscedastic regression 
with fixed design. Then, Arlot and Massart [9] validated the slope heuristics in a heteroscedastic with random 
design regression framework, for the selection of linear models of histograms. These result has been extended 
to the case of piecewise polynomial functions in [37] . Lerasle [28, 29] has shown the optimality of the slope 
heuristics in least-squares density estimation for the selection of some linear models for both independent and 
dependent data. It has also been shown in [35]— refining previous partial results of [17]— that the slope heuristics 
is valid for the selection of histograms in maximum likelihood density estimation. On the negative side, Arlot 
and Bach [5] proved that the constant two between the minimal penalty and the optimal one is not always valid 
for the selection of linear estimators in least-squares regression with fixed design. For instance, kernel ridge 
regression leads to a ratio between the optimal penalty and the minimal one that takes values between 1 and 
2. The existence of a minimal penalty—that can be estimated in practice—seems to be general however, even 
for the selection of linear estimators. 

If the noise is homoscedastic, then the shape of the ideal penalty is known and is linear in the dimension of 
the models as in the case of Mallows’ Cp. However, if the noise is heteroscedastic, then Arlot [4] showed that the 
ideal penalty is not in general a function of the linear dimension of the models. Hence, it is likely that finding 
a good penalty shape in order to use the slope heuristics will be hard and another approach would be needed. 
Probably, the most commonly used method to select an hyperparameter—such as the linear dimension of the 
models in our problem—in practice is the V-fold cross-validation (VFCV) procedure [22], with V classically 
taken to be equal to 5 or 10. 

Despite its wide success in practice, there is still quite few theoretical results concerning VFCV, that are 
surveyed in [6[. Some asymptotic results are described in [23] . Some papers more specifically address the 
efficiency of VFCV as a model selection tool by deriving oracle inequalities. But most results, such as in [27] 
in a general learning context or in [40] for least-squares regression, do not allow to tackle the question of the 
optimality of the procedure as a model selection tool, since they prove oracle inequalities with unknown or 
suboptimal leading constant. A notable exception is [3], which proves that VFCV for a fixed V is indeed 
asymptotically suboptimal for the selection of regressograms. This is simply explained by the fact that VFCV 
gives a biased estimation of the risk, as emphasized earlier by Burman [13], who proposed to remove this bias. 

Building on ideas of [13], Arlot [3] defined the so-called V-fold penalization and proved its asymptotic 
optimality, even for fixed V, for the selection of histograms. In particular, the procedure adapts to the het- 
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eroscedasticity of the noise, a property of y-fold techniques also putted on emphasis in [7] in the context of 
change-point detection. The idea is that y-fold penalization gives an unbiased estimate of the risk by adding 
to the empirical risk a cross-validated estimate of the ideal penalty. However, in practice H-fold penalization 
and cross-validation roughly give the same accuracy, since the over-penalization performed by cross-validation 
can actually be an advantage when the sample size is small to moderate. Concerning the choice of V in either 
VFCV or penalization, Arlot and Lerasle [8] recently justified in a least-squares density estimation context that 
the choice of H = 5 or 10 is a reasonable choice. 

The theoretical investigation of optimality of either the slope heuristics or H-fold strategies will be based, 
among other things, on sharp results that describe the concentration of the true and the empirical excess risks 
when the model is fixed—but with dimension allowed to depend on the sample size. Since the excess risk of an 
empirical risk minimizer is a central object of the theory of statistical learning, such concentration result and 
subsequent optimal upper and lower bounds for the excess risk of least-squares estimators are of independent 
interest. Moreover, excess risk’s concentration around a single deterministic point is an exciting new direction 
of research that refines more classical excess risk bounds. It recently gained interest after the work of Chatterjee 
[18], proving concentration inequalities for excess risk in least-squares regression under convex constraint and 
deducing universal admissibility of least-squares estimation in this context. 

It is worth noting that one of the main arguments developed in [18] and leading to excess risk’s concentration 
is a formula expressing the excess risk as the maximizer of a functional related to local suprema of a Gaussian 
process. In fact, such a representation of the excess risk of a general M-estimator in terms of an empirical process 
appeared earlier in Saumard [36]— see Remark 1 of Section 3 therein—and was also used to prove concentration 
inequalities for the excess risk of a projection estimator in least-squares regression. Building on [18], Muro 
and van de Geer [33] recently proved concentration inequalities for the excess risk in regularized least-squares 
regression and van de Geer and Wainwright [39] proposed a generic framework of regularized M-estimation 
allowing to derive excess risk’s concentration. These studies are also both based on excess risk’s representation 
in terms of either a Gaussian or an empirical process. 

Let us now detail our contributions: 

• We propose a new analytical property, allowing to deal with a lot of functional bases, that we call strongly 
localized basis. We show that it is a refinement on the classical concept of localized basis [11], that 
encompasses the cases of histograms, piecewise polynomials and compactly supported wavelets. We prove 
better results for strongly localized bases than for localized bases, while all known examples of localized 
bases are in fact strongly localized. Therefore, the concept of strongly localized basis is a way to describe 
some functional bases that is of independent interest and that could be used in many other nonparametric 
settings. 

• We substantially extend the theoretical analysis of the slope heuristics, generalizing the results of [9, 37] 
to the case of strongly localized bases. 

• We prove sharp oracle inequalities for the H-fold cross validation with fixed V, showing that it asymp¬ 
totically recovers an oracle model learned with a fraction equal to 1 — V~^ of the original amount of 
data. Then we improve on these bounds by considering H-fold penalization, which satisfies optimal or¬ 
acle inequalities. By proving such a result, we generalize a previous study of Arlot [3], from the case of 
histograms to the case of strongly localized bases. 

• We prove concentration bounds for the excess risk of projection estimators, that are of independent 
interest. These results are based on previous work [36] and on a new approach to sup-norm consistency. 
We indeed generalize previous representation formulas in terms of empirical process for the excess risk of 
a (regularized) M-estimator obtained in [36, 39] to any functional of a M-estimator and use it to obtain 
bounds in sup-norm for projection estimators on strongly localized bases. These new representation 
formulas are also of independent interest, since they are totally general in M-estimation. 

• We show in our experiments the good behavior of the slope heuristics for the selection of linear wavelet 
models. Indeed, it often compares favorably to VFGV and penalization. In addition, Mallows’ Cp seems 
to be also efficient. We also recover in our more general framework some previous observations of Arlot 
[3[: even if the H-fold penalization has better theoretical guarantees than the H-fold cross validation, it 
has only comparable efficiency in practice. 

The paper is organized as follows. In Section 1, we describe the statistical framework. The concept of 
strongly localized basis is presented in Section 2. The slope heuristics is validated in Section 3, and H-fold 
strategies are considered in Section 4. Then we expose our results for a fixed model, that are of independent 
interest, in Section 5. Numerical experiments are detailed in Section 6. The proofs are postponed to Section 7. 



4 


F. Navarro and A. Saumard 


1 Statistical framework 


We consider n independent observations = {Xi, Yi) G AxM with common distribution P, as well as a generic 
random variable ^ = {X,Y), independent of the sample following the same distribution P. The 

feature space A is a subset of d > 1. The marginal distribution of X^ is denoted P^. We assume that the 
following relation holds, 

y = s, {X) + aiX)e , 

where s* G L 2 {P^) is the regression function of Y with respect to X to be estimated. Conditionally to X, 
the residual s is normalized, i.e. it has mean zero and variance one. The function cr : A —is the unknown 
heteroscedastic noise level. 

To estimate s», we consider a finite collection of models A4„, with cardinality depending on the sample size 
n. Each model m G will be a finite-dimensional vector space of linear dimension Dm- The models that we 
consider in this paper are more precisely defined in Section 2 below. 

We write ||s ||2 = s^dP^^ the quadratic norm in L 2 {P^) and Sm the orthogonal projection of s* onto 
TO in the Hilbert space (P^) ; IMl 2 )- ^or a function / G Ti (P), we write P{f) = Pf = E [/ (C)]- By setting 
7 : L 2 {P^) Li (P) the least-squares contrast, defined by 

7 (s) : {x, y)^ (y-s {x)f , s G L 2 {P^) , 
the regression function s* is characterized by the following relation. 


s* = arg min P ( 7 ( 5 )). 
sGL2{P^) 


The projections Sm also satisfy, 

Sm = argminP (7 (s)). 

sGm 

For each model to G we consider a least-squares estimator 'sm (possibly non unique), satisfying 

Sm G argmin {P„ (7 (s))} 

sGm 

= arg min i - ^ {Y, - s {Xi)f I , 
sem n ^' 

I i=l ) 

where Pn = n~^ ' 127=1 empirical measure built from the data. 

The performance of the least-squares estimators is tackled through their excess loss, 

^ ( 5 * , Sm') ■ — P in {Sm') 7{®*)) — II Sm II2 ' 

We split the excess risk into a sum of two terms, 

£ iS:,., Sm) — £ ( 5 * , Sm) £ (Sm, Sm) 5 


where 

£ (s*,Sm) := P(7(Sm) -7 is*)) = ||Sm “ S *||2 ^nd £ iSm,Sm) ■= P (7 (Sm) “ 7 (Sm)) > 0. 

The quantity £is*,Sm) is a deterministic term called the bias of the model to, while £ism,'sm) is a random 
variable that we call the excess risk of the least-squares estimator 'sm on the model to. Notice that by the 
Pythagorean theorem, it holds 

£ iSm, Sm) — II SrM Sm II2 • 

Having at hand the collection of models A4„, we want to construct an estimator whose excess risk is as close 
as possible to the excess risk of an oracle model to*. 


TO* G arg min {^(s*,s^m)}- (1) 

meMn 

We propose to perform this task via a penalization procedure: given some penalty pen, that is a function from 
Mn to M®", we consider the following selected model. 


TO G arg min {P„ (7 (s^m)) + pen (to)} . (2) 

meM„ 

The goal is then to find a good penalty, such that the selected model to satisfies an oracle inequality of the form 

£ (':^*, Sm) y t* X inf £ (s*, Sm) 5 (3) 

meM„ 

with probability close to one and with some constant C > 1 , as close to one as possible. 
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2 Strongly localized bases 

We define here the analytic constraints that we need to put on the models in order to derive our model selection 
results. We also provide various examples of such models. 


2.1 Definition 

Let us take a finite-dimensional model m with linear dimension Dm and an orthonormal basis The 

family {<fk)k=i called a strongly localized basis (with respect to the probability measure P^) if the following 
assumption is satisfied: 

(Aslb) there exist > 0, bm & a partition of {1,... ,Dm}, positive constants and an 

orthonormal basis of (to, 11 - 112 ) such that 1 < < A 2 < ... < < -l-oo, 

bm 

'/a < Tm ( 4 ) 

and 

for alH e &m}, for all fc G IIj, \\ipk\\ao < (5) 

Moreover, for every {i,j) G bmY k G Hi, we set 

n^ife = {^ e nj;supp((/jfc)P|supp(v3/) yf 0 | 

and we assume that there exists a positive constant Ac such that for all j G {1,..., &m}, 

max Card ( 11 ^ 1 ^) < Ac {^AjA~^ V l) . ( 6 ) 


Up to our knowledge, the concept of strongly localized basis is new. In (5), we ask for a control in sup-norm 
of each element of the considered basis. We also require in ( 6 ) a control of the number of intersections between 
the supports of the elements of the considered orthonormal basis. 

As shown in Section 2.2 below, the property of strongly localized basis allows to unify the treatment of 
some models of histograms, piecewise polynomials and compactly supported wavelets. From this point of view, 
we may interpret the parameter bm as the number "scales" in the basis, which in particular equals one for 
histograms and piecewise polynomials. It is also equal to the number of resolutions in the multi-resolution 
analysis associated to wavelet models. See Section 2.2 below for details about these examples. 

The classical concept of localized basis (Birge and Massart [11]) also covers the previous examples. More 
precisely, recall that an orthonormal basis {^k)i^Pi of (to, 11 - 112 ) is a localized basis if there exists > 0 such 


that 


for all P = {Pk)k=i e 




Dr, 






Pi 's/ Dm 


max 

ke{l,...,D^} 


\l3k\. 


In fact, we show in the next proposition that strongly localized bases are localized in the classical sense. 
The interest of strongly localized bases over localized bases then comes from the fact that it allows to derive 
concentration bounds for the excess risks for models with dimension much larger than what we can prove with 
localized bases (from Dm «C for localized bases to Dm ^ n for strongly localized ones). This point is 
detailed in Section 5.1. 


Proposition 2.1 If an orthonormal basis {‘fk)kPi strongly localized, then it is localized. More precisely, if 
satisfies (Aslb), then for every P = {Pk)k=i ^ 


Drr, 

'^PkVk 


< 


< 


E \/~Ai max 1/3/1 

rn max \Pk \ ■ 


Reciprocally, if {‘Pk)kAi ® localized basis as in (2.1), then it achieves (4) and (5) above with b = 1, A, = Dm 
and Vm = max{r,^, 1 }. 

Proposition 2.1 shows that the parameter Vm appearing in the definition of a strongly localized basis is 
closely related to the parameter r^ defining a localized basis. 

The proof of Proposition 2.1 can be found in Section 7.1. 
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2.2 Examples 

We investigate here the scope of the concept of strongly localized basis by providing some examples of linear 
models achieving this condition. 

2.2.1 Histograms and piecewise polynomials 

It is proved in [36] that linear models of histograms and more general piecewise polynomials with bounded 
degree are localized bases in L 2 {P^) if the underlying partition 7^ of A is lower-regular in the sense that there 
exists a constant Cm such that 

0 < mf (/) . 

More precisely, if r S N is the maximal degree of the piecewise polynomials —r = 0 in the case of histograms— 
then any orthonormal basis {fij, I GV, j G {0,... ,r}} of (m, 11 - 112 ) such that for all j G {0,... ,r}, (fjj is 
supported by the element I of V, is localized. Hence, by Proposition 2.1, it achieves Inequalities (4) and (5) of 
the definition of strongly localized basis with b = 1 and Ai = Dm = {r + l)Card(P). It is also immediately seen 
that such basis achieves in this case ( 6 ) with A,, = r -|- 1. Furthermore, for histograms (Lemma 4, 

[36]) and it has a more complicated expression for piecewise polynomials (Lemma 7, [36]). As a result, models 
of histograms and piecewise polynomials with bounded degree and underlying lower-regular partition P of A 
are endowed with a strongly localized structure. 

2.2.2 Compactly supported wavelet expansions 

We assume here that A = [0,1] and take 6 ^ G N*. For details about wavelets and interactions with Statistics, 
we refer to [25] . Set (pQ the father wavelet and tpo the mother wavelet. For every integers j > 0, 1 < fc < 2 ^, 
define 

■ X !->■ 2 t/^' 0 o {‘ 2 ^x — k + 1 ) . 

As explained in [19], there are many ways to consider wavelets on the interval. We will consider here one of the 
most classical solution, that consists of using "periodized" wavelets. To this aim, we associate to a function ip 
on K, the 1 -periodic function 

l/;P“ (x) +p) . 

pGZ 

Notice that if -0 has a compact support, then the sum at the right-hand side of the latter inequality is finite for 
any x. 

We set for every integers i,j, I > 0, satisfying i < j and 1 < Z < 2®, 

A(j) = {{j,k) ; 1 < fc < 2-’} , 

A(j,*,0 = {U,k) ■, V-^l-l) + l<k<2^-U}. 

Moreover, we set 't/j-i.k i^c) = (j)o{x — k 1), 


bm 

A(-l) = {(-l,fc) ;supp(V;_i,fc) n [0,1] ^ 0} and At^ = |J A(j) . 

i=-i 

Notice that for every integers i,j > 0 such that i < j, {A{j,i,l) ; 1 < Z < 2®} is a partition of A{j), which 
means that 

A (j) = IJ ^ all 1 < Z, h < 2 ®, A (j, i, Z) Q A (j, i, h) = 0 . 

;=i 

We consider the model 

m = Span ; A G Af,^} . (7) 

Notice that the linear dimension Dm of m satisfies Dm = 2^™+^. 

Proposition 2.2 With the notations above, if (po o,nd tpo o,xe compactly supported, then ^ G A;,^} is a 

strongly localized basis on ([0,1] ,Leb), with parameters bm us defined above, Aj = 2^ for j > 0 and A_i = 1 
(an explicit value of Vm is also given in the proof, but is more complicated). 
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The proof of Proposition 2.2 can be found in Section 7.1. Proposition 2.2 proves that periodized compactly 
supported wavelets on the unit interval form a localized basis for the Lebesgue measure. 

Considering the Haar basis, we can avoid the use of periodization and consider more general measures than 
the Lebesgue one. 


Proposition 2.3 Let us take (j)o = l[o,i] o-nd ipo = l[o,i/ 2 ] ~ l(i/ 2 ,i] ci^d eonsider the model m given in (7). Set 
for every integers j > 0, 1 < fc < 2^, 


Pj,k- = P 


.X 


2 -^ {k-l),2-^ [k-- 


, P3,k,+ = P 


X 


k-^ 

2 


, 2 -^fc 


i’j,k : a; e [ 0 , 1 ] (pj,fc.+ l[2-J(fe-l).2-J(fe-l/2)] “ Pf.fe -l(2-j(fe-i),2-ife]) ■ 

\jPj,k,+Pj,k- + Pj^k,-Pj,k,+ 

Moreover we set ijj-i = (fo. Assume that has a density f with respect to Leb on [0,1] and that there exists 
Cmin > 0 such that for all x G [0,1], 

f {P) — ^min ^ 0- 

Then {ifx ; A S is a strongly localized orthonormal basis of (m, 11-112). Indeed, by setting A_i = 1 and 

Aj =2^, J > 0, we have for every integers j > 0, 1 < k < 2^, 


Hi, 


felino — 


-A, 


and 




E 



\/ Dm ■ 


Finally, if = {A G A^ ; supp f] supp (</?>) ^ 0} for p G A;,^ and j G {-1,0,1, 




m^Card (Aj|^) < AjA- ^ V 1. 


Proposition 2.3, which proof is straightforward and left to the reader, ensures that if has a density which 
is uniformly bounded away from zero on X, then the Haar basis is a strongly localized orthonormal basis for 

the L 2 (P^)-norm. More precisely, with notations of (Aslb), = max x/2 + 1, -y/2c“{jj j- and Ac = 1 are 

convenient. 


3 The slope heuristics 

3.1 Principles 

The slope heuristics is a conjunction of general facts about penalization techniques in model selection, that lead 
in practice to an efficient penalty calibration procedure. Let us briefly recall the main ideas underlying the slope 
heuristics. 

Consider the model selection problem described in (2). First, there exists a minimal penalty, denoted pen^^jj^, 
such that if pen (mi) < peUj^^jj^ (toi) where mi is one of the largest models in then the procedure defined 
in (2) totally misbehaves in the sense that the dimension of the selected model is one of the largest of the 
collection, Dff, > Dmi, and the excess risk of the selected model explodes compared to the excess risk of the 
oracle. 

Furthermore, if pen > peUj^^jjj uniformly over the collection of models, then the selected model is of reasonable 
dimension and achieves an oracle inequality as in (3). 

Arlot and Massart [9] conjectured the validity in a large M-estimation context of the following candidate for 
the minimal penalty, 

peUmin (l^) — ^ [^emp (^m; -^m)] ? (8) 

where £emp (sm, Sm) is the empirical excess risk on the model m € A4n, defined to be 


^emp (Sm; ^m) — Pji (T i^m') T (Sm)) ^ 0- 


(9) 
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Figure 1: If pen^jj^ = amin • pen^j^^pg, for a known penalty shape pen^j^g^pg, then one can estimate amin by using 
the dimension jump. This gives Smin and pen^p^ = 2amin ‘ pengj^^pg is then an optimal penalty according to the 
slope heuristics. 


Finally, if the penalty satisfies pen = 2 x penj^^j^ then it is optimal in the sense that the excess risk of the 
selected model converges to the excess risk of the oracle when the amount of data tends to infinity, 

infmGA4„ 

From the previous facts, two algorithms have been built in order to optimally calibrate a penalty shape. Both 
are based on the estimation of the minimal penalty. One takes advantage of the dimension jump of the selected 
model occurring around the minimal penalty (see Figure 1) and the other is based on formula (8), performing 
a robust regression of the empirical risk with respect to the penalty shape. We refer to the survey paper [10] 
for further details about the algorithmic and theoretical works existing on the slope heuristics. 

3.2 Assumptions and comments 

Set of assumptions : (SA) 

(PI) Polynomial complexity of there exist some constants cm, cum > 0 such that Card (At„) < 

(Auslb) Existence of strongly localized bases: there exist tm, > 0 such that for every m G Ain, there exist 
bjn G N*, a partition (nd^P]^ of {1,..., Dm}, positive constants and an orthonormal basis 

of (to, 11-112) such that 0 < Ai < A2 < ... < < +00, 

bm 

'y ^ \/^Ai ^ ^A4 \/Dm, 

i=l 

and 

for all i G {!,..., 6 ^}, for all k G 11^, ||v?/c|loo ^ Tm'/A- 
Moreover, for every {i,j) G {!,..., bm}^ and k G 11^, we set 

n^lfe = {^ S Uj ; supp {ipk) Pi supp (ifi) ^ 0| 
and we assume that for all j G {1,. ■ ■ ,bm}, 

m^ Card (11^1 ^) < Ac }AjA~^ V l) . 

(P2) Upper bound on dimensions of models in Ain'- there exists a positive constant Am,+ such that for every 
TO G Ain, i<Dm< max{Dm,6^A{,^} < AM,+n . 

(P3) Richness of Ain- there exist too,toi G Ain and some constants Crichj Arich > 0 such that Dmo S 
[a/A, CrichV^ and Dm^ > A^chn (lnn)“ . 
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(Ab) A positive constant A exists, that bounds the data and the projections Sm of the target s* over the 
models m of the collection |li| < A < oo, ||sm|loo < ^ < oo for all m G A4n- 

(An) Uniform lower-bound on the noise level: a (X^) > CTmin > 0 a.s. 

(Apu) The bias decreases as a power of there exist /3+ > 0 and (7+ > 0 such that 

< C+d;^^+ . 

The set of Assumptions (SA) is very similar—and actually extends—the set of assumptions used in [9] 
and [37] to prove the validity of the slope heuristics in heteroscedastic least-squares regression, respectively for 
models of histograms and piecewise polynomials. 

The main features in this set of Assumptions (SA) are as follows. Assumption (PI) amounts to say that we 
select a model among a "small" collection, as opposed to large collection of models whose cardinal is exponential 
with respect to the amount of data n. Roughly speaking, this assumption allows to neglect the deviations of 
the excess risks on each model around their mean, since concentration inequalities shown in Section 5 below for 
these quantities are exponential. 

Then Assumptions (Auslb), (P2), (Ab) and (An) enable to apply the desired concentration inequalities for 
the excess risks established in Section 5. As shown in Section 2.2.2, Assumption (Auslb) allows in particular 
to encompass the case of compactly supported wavelet expansions on the interval. 

For further and more detailed comments on the above assumptions, we refer to [9] and [37]. 

3.3 Statement of the theorems 

Let us now state our results validating the slope heuristics for the selection of uniformly strongly localized bases. 
The first theorem exhibits the empirical excess risk defined in (9) as a (majorant of the) minimal penalty, as 
conjectured by Arlot and Massart [9[. 

Theorem 3.1 Take a positive penalty: for all m G Ain, pen (m) > 0. Suppose that the assumptions (SA) of 
Section 3.2 hold, and furthermore suppose that for Apen G [0,1) and Ap > 0 the model mi of assumption (P3) 
satisfies 

0 < pen (mi) < ApenE [£ 

emp )] , 

with probability at least 1 —Apn“^. Then there exist a constant Li > 0 only depending on constants in (SA), as 
well as an integer no and a positive constant only depending on Apen and on constants in (SA) such that, 
for all n > no, it holds with probability at least 1 — Lin~^, 

Dfh > L2n\n (n) ^ 


and 

P10+/O-+P+) 

£(^^,spi) ^ - —o inf (-5*, Stt,)} , 

(Inn) 

where /3+ > 0 is defined in assumption (Apy^) of (SA). 

In order to theoretically validate the slope heuristics described in Section 3.1 above, it remains, in addition to 
Theorem 3.1, to show that taking a penalty greater than the empirical excess risk ensures an oracle inequality 
and that taking two times the empirical excess risk yields asymptotic optimality of the procedure. That’s what 
we present now. 

Theorem 3.2 Suppose that the assumptions (SA) of Section 3.2 hold, and furthermore suppose that for some 
S G [0,1) and Ap, Ar > 0, there exists an event of probability at least 1 — Apn~^ on which, for every model 
m G A4n such that Dm > Ajo{^+ (Inn)^, it holds 

I pen (m) 2E [.^emp (^m, ^m)]| ^ ^ TE [t^emp -^m)]) 

together with 

/£(s*,Sm) (Inn)^ 

I ^ 


|pen (m)| < 


n 


( 10 ) 
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Then, for any rj S (0, /3+/ (1 + /3+)), there exist an integer uq only depending on rj, 6 and /3+ and on constants 
in (SA), a positive constant only depending on cm given in (SA) and on Ap, two positive constants L4 and 
L5 only depending on constants in (SA) and on Ar and a sequence 


On < 


Li 

(In 


such that it holds for all n > uq, with probability at least 1 — L^n 

Dfn < n^+L{i+M 


and 




1 + d 

1-S 


50n 


(i-sy 


inf 

mdM, 


(-5:^=5 Z/5 


(Inn)^ 


Assume that in addition, the following assumption holds. 


(Ap) The bias decreases like a power of Dm: there exist (3- > /3+ > 0 and C+,C- > 0 such that 


- <e{s,,Sm)<C+D;)f+ . 


Then it holds for all n > uq ((S'A) ,C-,fi-,fi+,rj,5), with probability at least 1 — L^n 


Am,+ {\nnf <Dff,< n^+DA+P+) 


and 


i (s*, Sm) 



^On \ 
(l-Sf) 


inf {£(s*,Sm)}. 

mGAin 


( 11 ) 


Notice that taking (5 = 0 in Theorem 3.2 gives an oracle inequality (11) with leading constant equal to 
1 + 59n and thus converging to one when the amount of data tends to infinity. This shows the optimality of 
the penalty equal to two times the minimal one, thus validating the slope heuristics for the selection of models 
endowed with a strongly localized basis structure. 

The proofs of Theorems 3.1 and 3.2 simply derive from [37] and Theorem 5.1 above (see Section 7.2 for more 
details). 


4 V-fold model selection 


We need some further notations and we follow here the notations of Arlot [3]. In order to highlight the 
dependence in the training set, we will denote s'm (Pn) for the least-squares estimator learned from the empirical 
distribution P„ = l/nX]r=i F-fold sampling, we choose some partition of the index set 

{1 ,..., n} and define 


p(i) 

n 


1 

Card {Bj) 


idBj 


and p!^ L 


1 

n — Card (Bj) 




together with the estimators. 



4.1 Classical V-fold cross-validation 

In the VFCV procedure, the selected model fhvFCV optimizes the classical F-fold criterion. 


TOVFCV G arg min {critvFCV (w)} , (12) 

mCiMn 

where 

.. V 

critvFCV irn) = (^m ■ (13) 

i=i 

We assume that the partition is regular in the sense that for all j G {1, ... ,V}, Card (Bj) = n/V and in practice 
we can always ensure that for all j, [Card (Bj) — n/F| < 1. 
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Theorem 4.1 Assume that (SA) holds. Let r G (2,+oo) and V G {2,... ,n — 1} satisfying 1 < V < r. Define 
the VFCV procedure as the model selection procedure given by (12). Then, there exists a constant T(sA),r > 0 
such that for all n> ng ((iSA) ,r), with probability at least 1 — Li^sAYj.n~‘^, 


i Sr. 


,)< 



J(-1) 


)} 


+ L(^SA),r 


(Inn)^ 

n 


The proof of Theorem 4.1 can be found in Section 7.3. 

In Theorem 4.1, we show an oracle inequality with leading constant converging to one when the amount of data 
tends to infinity, that compares the excess risk of the model selected via VFCV to the excess risk of the best 
estimator learned with a fraction of the amount of data equal to 1 — V~^. Thus, VFCV allows to asymptotically 
recover the oracle learned with a fraction of the amount of data equal to 1 — V~^. This is natural since the 
V-fold criterion given in (13) is an unbiased estimate of the risk of estimators learned with a fraction 1 — V~^ 
of the data. 

Consequently, it seems from Theorem 4.1 that there is some room to improve the performances of VFCV 
for fixed V, since the oracle learned with all the data has better performances (smaller excess risk) than the 
oracle learned with only part of the initial data. Furthermore, using the concentration inequalities derived in 
Theorem 5.2, we roughly have, for any m G Mri, 


E 




= ■^(s*,Sm)+E £ 

^ £ (s*, Sm) + T T/-l\ 

4 (1 — v ^)n 

V 

^m) ^ (SjTj, ^TTT,)] 

< :j^r^E[£(s*,s™)]. 


The natural idea to overcome this issue is to try to select a model using an unbiased estimate of the risk of the 
estimators Sm (rather than for VFCV). This is what we propose in the following section. 


4.2 V-fold penalization 

Let us consider the following penalization procedure, proposed by Arlot [3] and called V-fold penalization. 


WpenVF G arg min {critpenVF (m)} , 
mGA 4 n 


where 

critpenVF (w) = (7 (sm)) + penyF (m), (14) 


with 


The idea behind V-fold 
penalty penj^j, the latter 


penvF (m) 


V- 1 
V 




(15) 


1=1 

penalization is to use the V-fold penalty penyF as an unbiased estimate of the ideal 
allowing to recover exactly the oracle to* defined in 1. Indeed, we can write 


TO* G arg miurnGM,, {Pn (7 (Sm)) + penj^j (to)} , 


where 

Penid (to) = P (7 (sm)) - Pn (7 (Sm)) ■ (16) 

Comparing (15) and (16), it is now clear that the V-fold penalty is a resampling estimate of the ideal penalty 
where for each j G {1,..., V} the role of P is played by P„ and the role of P„ is played by 

Now, the benefit compared to VFCV is that V-fold penalization is asymptotically optimal, as stated in the 
following theorem. 


Theorem 4.2 Assume that (SA) holds. Let r G (2,+ 00 ) and V G {2,... ,n — 1} satisfying 1 < V < r. Define 
the V-fold penalization procedure as the model selection procedure given in (H). Then, there exists a constant 
P(SA),r > 0 such that for all n > no {{SA) ,r), with probability at least 1 — L(^sA),r'n~^ , 


t (s*,. 


'^penVF/ — 


\ ymn 


inf (s*, Sttj)}-|- . 

mGMn 


(Inn)^ 


















12 


F. Navarro and A. Saumard 


The proof of Theorem 4.2 can be found in Section 7.3.2. 

Theorem 4.2 exhibits an oracle inequality with leading constant converging to one, comparing the risk of the 
model selected by F-fold penalization to the risk of an oracle model. This shows asymptotic optimality of 
the procedure and extends to the case of the selection of linear models endowed with a strongly localized basis 
structure, previous optimality results obtained by Arlot [3] for the selection of histograms, also in heteroscedatic 
regression with random design. 


5 Excess risks’ concentration 

We formulate in this section optimal upper and lower bounds that describe the concentration of the excess risks 
for a fixed parametric model, but with dimension depending on the sample size. In the case of the existence of 
a strongly localized basis, we prove optimal bounds for models with dimension roughly smaller than n (up to 
logarithmic factors). 

The proofs, which involve sophisticated arguments from empirical process theory, are partly based on earlier 
work by Saumard [36]. Furthermore, we use some representation formulas for functionals of M-estimators, 
which generalize previous excess risks representations exposed by Saumard [36], Chatterjee [18], Muro and van 
de Geer [33] and van de Geer and Wainwright [39]. We give these formulas in Section 5.2. 

5.1 Strongly localized bases case 

The following result of consistency in sup-norm for the least-squares estimator is a preliminary result that will 
be needed in the proof of our optimal concentration bounds. 

Theorem 5.1 Let a > 0. Assume that m is a linear vector space of finite dimension Dm satisfying (Aslb) 
and use notations of (Aslb). Assume moreover that the following assumption holds: 

(Ab(m)) A positive constant A exists, that bounds the data and the projection Sm of the target s* on the model 
m: \Yi\ < A <oo, ||sr„||,^ < A < oo. 

If there exists A+ > 0 such that 

inax{Dm,bmAb^} <A+ ^ 

(Inn) 

then there exists a positive constant La^t^^ol such that, for all n > no (A_^, ck); 


^ LA.,rm,c 


Dm. In 7 


< n 


(17) 


Theorem 5.1 extends to the case of strongly localized bases previous results obtained in [36] for the consistency 
in sup-norm of least-squares estimators on linear models of histograms and piecewise polynomials. Note that 
minimax rates of convergence in sup-norm—and more general Lq norms, 1 < q < oo —for random design 
regression have been obtained by Stone [38] . 

Theorem 5.1 is based on new formulas for functionals of M-estimators that are described in Section 5.2 
below. 


Remark 5.1 The main results of our paper are proved for models endowed with a strongly localized basis. In 
fact, we can also prove some results for the slightly weaker and more classical assumption of localized basis, 
defined in (2.1). The main difference is that with models having a strongly localized basis we can describe the 
optimality of model selection procedures for the selection of models with dimension up to nj (Inn) , whereas for 
the localized basis case, we describe optimal results for models with dimension smaller than nf!'^j (Inn) . This 
is an issue for instance in the slope heuristics, where the two algorithms of detection of the minimal penalty are 
based on the behavior of the largest models in the collection at hand. At a technical level, the essential gap is 
that for models with localized bases, we are able to prove Inequality (17) in Theorem 5.1 for models with linear 
dimension Dm ^ n^/^ (see Remark 7.1). 

Let us now detail our concentration bounds for the excess risks. Theorem 5.2 below is a corollary of Theorem 
2 of [36] and Theorem 5.1 above. 






Optimal model selection in heteroscedastic regression 


13 


Theorem 5.2 Let A^,A-,a > 0. Assume that m is a linear vector space of finite dimension Dm satisfying 
(Aslb) and use notations of (Aslb). Assume moreover that Assumption (Ab(m)) defined in Theorem 5.1 holds. 
If we have 

{Ainf < Dm< max {Dm, bmAb^} < A+ ^ , 

(Inn) 


then a positive constant Lq exists, only depending on a, A- and on the constants A, Umin o,nd Vm such that by 
setting 

, (18) 


£n = Lq max 




we have for all n > no (A_, A^, A, rm, crmin, ce), 


a 


(1 - e„) - <i (Sm,Sm) < (l+£n) 


c 


(1 -el)^ < 4mp (Sm, Sm) < (l + £: 


2 \ 


> 1 - 10n"“ 

> 1 - 5n"“ , 


(19) 

( 20 ) 


where Cm = Y{k=i ((^ “ (X)) ■ ipt {X)). 


Theorem 5.2 exhibits the concentration of the excess risk and the empirical excess risk around the same 
value equal to n~^Cm- Furthermore, it is easy to check that the term Cm is of the order of the linear dimension 
Dm- More precisely, it satisfies 


0 < 


1 D ri 


<Cm< 


3AD„ 


See [36], Section 4.3 for the details, noticing that with the notations of [36], it holds Cm = DmX\ m /It 
is also worth noticing that the empirical excess risk concentrates better than the true excess risk, the rate of 
concentration for the empirical excess risk—given by the term —being the square of the concentration rate 
En of the excess risk. This will be explained at a heuristic level in Section 5.2 using representation formulas for 
the excess risks in terms of empirical process. 

Compared to other concentration results established in [18], [33] and [39] for the excess risk of least-squares 
or more general M-estimators, Inequalities (19) and (20) share the strong feature of computing the exact 
concentration point, which is equal to n~^Cm- On contrary, the methodology built by Chatterjee [18] and 
extended in [33] and [39], gives the concentration of the excess risk around a point, but says nothing on the 
value of this point. We explain further this important aspect in Section 5.2 below. 


5.2 Representation formulas for functionals of M-estimators 

In this section only, we assume that the contrast 7 defining the estimator s)n is general, so that 'sm is a general 
M-estimator—assumed to exist—on a model to, 

Sm e argmin{P„ (7 (s))} 

sGm 

= argmini - V7(s)(Z,) i , 
sGm n ^' 

I ) 

where {Zi ,..., Z„) S Z" is a sample of random variables living in some general measurable space Z. 

Define X a nonnegative functional from to to IR+: Ws € m, D (s) > 0. Then the following representation of 
T (sm) in terms of local extrema of the empirical process of interest holds. 

Proposition 5.3 With the notations above, let us also write me (resp. dc), C > 0, the subset of the model m 
such that the values of the functional T on this subset are bounded above by (resp. equal to) C: 

me = {s S TO ; P (s) < C} and de = {s G m ; P (s) = C} . 


Then, 


X (Sm ) 


S argmm 
00 


inf Pj 
sEdc 


(7 (s)) 


( 21 ) 


X {Sm ) 


G argmm 
00 


inf Pr 
s£mc 


(7(s)) 


( 22 ) 


and 
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Proposition 5.3, whose proof is simple and written in Section 7.4.2, casts the problem of bounding any 
functional of a M-estimator into an empirical process question, consisting of comparing local extrema of the 
empirical measure taken on the contrasted functions of the model. Up to our knowledge, such a result is 
new. 

Considering the particular case of the sup-norm, formula (21) is our starting point to prove Theorem 5.1. 
More precisely, we use the fact that taking 

■^(Sm) = IlSm ^mlloo ’ 

formula ( 21 ) directly implies that for any C > 0 , 

P (||Sm ~ Smiloo — 

< P inf P„( 7 (s))< inf Pn{j{s)) 

\s^m\mc sGmc 

See Section 7 for the complete proofs. 

Another interesting application of Proposition 5.3 would be to derive bounds for the Lp, p > 1, moments—or 
more general Orlicz norms—of a M-estimator. We postpone this question for future work. 

Remark 5.4 Nonnegativity of iP is not essential (but suitable to our needs) and considering functionals with 
negative values is also possible, with straightforward adaptations of formulas of Proposition 5.3. 

Taking P to be the true or the empirical excess risk on m, we get the following results, refining the repre¬ 
sentation formulas previously obtained by [36] —see Remark 1 of Section 3 therein. 


Proposition 5.5 With the notations above, let also Q be a nonnegative functional on m and Rq S K+U{-|-oo}. 
If the following event holds {Q (sm) ^ .Ro} case Rq = -l-oo corresponds to the trivial total event), then by 
setting 

rhc = {s G m ; P {s) < C & G (s) < Ro} and dc = {s G m ; P {s) = C & ^ (s) < i?o} ) 


it holds 


) G argmax < sup {(P„ - P) (7 (s^) - 7 («))} - C 


00 I ^ . 

- \^s^dc 


)G argmax sup {(P„ - P) (7 (s^) - 7 («))} - C' 


<^>0 


and 


4mp (sm,Sm) = max < sup {(P„ “ P) (7 (Sm) “ 7(s))} “ C 
- Uerfc 


4mp (smPm) = max sup {(P„ - P) (7 (s^) - 7 (s))} - C 


O>0 


(23) 

(24) 

(25) 

(26) 


The same type of excess risks representation as the one obtained in (24) are at the core of the approach to 
excess risk’s concentration recently developed by Chatterjee [18], Muro and van de Geer [33] and van de Geer 
and Wainwright [39]. The main difference with our approach is that these authors rather use the parametrization 
t = VC and take into advantage an argument of concavity with respect to t of the supremum of the empirical 
process on "balls" of excess risk smaller than We refer to van de Geer and Wainwright [39] for more 
details about this concavity argument (called "second order margin condition" by these authors). But with this 
concavity argument, nothing can be said a priori about the point around which the excess risk concentrates. To 
obtain optimal bounds on this point, as in Theorem 5.2 above, we rather apply a technology developed in [36] 
and based on the least-squares contrast expansion around the projection Sm of the target. We refer to Section 
3 of [36] for a detailed presentation of the latter approach. 

Proposition 5.5 also allows to make it transparent the fact the empirical excess risk has better concentration 
rates—given by the term in Theorem 5.2—than the excess risk—which concentrates at the rate £„. Indeed, 
if we set 

r„ (C) := sup {{Pn - P) (7 (sm) - 7 (s))} - C, 

s&dc 

with {G (sm) < Ro} = ^llsm — Smiloo 4 the proof of Theorem 5.2 shows that r„ (C) concentrates 

around the quantity 2~^n~^CmC — C, which is parabolic around its maximum. The conclusion can now be 
directly read in Figure 2. 
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Figure 2: The true and empirical excess risks are given respectively as the maximizer and the maximum of the 
same function r„. If r„ is regular around its maximum this explains why concentration rate for the empirical 
excess risk—given by —is better than for true excess risk—given by £„. 

6 Numerical experiments 

A simulation study was conducted in order to compare the numerical performances of the model selection 
procedures we have discussed. We consider wavelet models as non trivial illustrative examples of the theory 
developed above for the selection of linear estimators using the slope heuristics and Wfold model selection. 
However, it is a rather different question than designing the best possible estimators using wavelet expansions, 
since these estimators are likely to be nonlinear as for the thresholding strategies (see e.g., [1] for a compara¬ 
tive simulation study of wavelet based estimators). Although a linear wavelet estimator is not as flexible, or 
potentially as powerful, as a nonlinear one, it still preserves the computational benefits of wavelet methods. 
See e.g., [2] which is a key reference for linear wavelet methods in nonparametric regression. All simula¬ 
tions have been conducted with Matlab and the wavelet toolbox Wavelab850 [20] that is freely available from 
http;//statweb. Stanford. edu/~wavelab/. In order to reproduce all the experiments, the codes used to gen¬ 
erate the numerical results presented in this paper will be available online at https; //github. com/f abnavarro. 

6.1 Computational aspects 

For sample sizes n = 256,1024,4096, data were generated according to 

1) — I^(A^j)£^, i — 1,..., u 

where X^’s are uniformly distributed on [0,1], Si’s are independent ^(0,1) variables and independent of Xi’s. 
In the case of fixed design, thanks to Mallat’s pyramid algorithm (see [30]), the computation of wavelet-based 
estimators is straightforward and fast. In the case where the function s* is observed on a random grid, the 
implementation requires some extra precautions and several strategies have been proposed in the literature (see 
e.g. [15, 24[). In the context of random uniform design regression estimation, [16] have examined convergence 
rates when the unknown function is in a Holder class. They showed that the standard equispaced wavelet method 
with universal thresholding can be directly applied to the nonequispaced data (without a loss in the rate of 
convergence). In this simulations study, we have adopted this approach, since it preserves the computational 
simplicity and efflciency of the equispaced algorithm. The same choice was made in the context of wavelet 
regression in random design with heteroscedastic dependent errors by [26] . Thus, in this case, the collection of 
models is computed by a simple application of Mallat’s algorithm using the ordered Vi’s as input variables. 

6.2 Examples 

Four standard regression functions representing different level of spatial variability (Wave, HeaviSine, Doppler 
and Spikes, see [21, 31, 14[) and the following four cr(-) scenarios were considered: 

(a) Low Homoscedastic Noise: aii(x) = 0.01; 

(b) Low Heteroscedastic Noise: Unix) = 0.02a:; 

(c) High Homoscedastic Noise: ahi{x) = 0.05; 

(d) High Heteroscedastic Noise: ah 2 (x) = 0.1a:. 
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(c) Doppler 



(d) Spikes 





(e) 


Figure 3: (a)-(d): The four test functions used in the simulation study sampled at 4096 points, (e): Wavelet 
coefficients of the test functions. 


The test functions are plotted in Figure 3 and a visual idea of the four noise levels is given in Figures 4(a)-(d). 
Several different wavelets were used. In the following, we only report in detail the results for Daubechies’ 
compactly supported wavelet with 8 vanishing moments. 

6.3 Four model selection procedures 

The performance of the following four model selection methods were compared: 

• The slope heuritics (SH): 

WSH G arg min {critsH(TO)}, 
meM„ 

with 

critsH(w) = Pn (7 (Sm)) + pensjj(m), 

and 

pensH(m) = 2---. 

where Smin is obtained from the dimension jump method (see Figure 1). Practical issues about SH are 
addressed in [10] and our implementation is based on the Matlab package CAPUSHE. 

• Mallow’s Cp (Cp): 

fhcp G arg min {critcp(TO)} , 

with 

critcp(TO) = Pn (7 (Sm)) + pencp(m), 

and 

/ N „ Dm 
pencp(TO) = 2 ^ , 

where cr^ is globally estimated by the classical variance estimator defined as 

^2 ^ (Pl...n; R^n/2) 

^ To ’ 

n — nil 

where = (^)i<z<n ^ ^^7 ^n/2 is the largest model of dimension n/2, and d is the Euclidean distance 
on 
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Figure 4: (a)-(d): Noisy version of Spikes for each tT(-) scenarios, (e): Typical reconstructions from a single 
simulation with n = 4096. The dotted line is the true signal and the solid one depicts the estimates 
(f): Graph of the excess risk against the dimension Dm and (shifted) critsH(w) (in a log-log scale). 

The gray circle represents the global minimizer fh of critsH('Rr) and the black star the oracle model m*. (g): 
Noisy and selected (black) wavelet coefficients (see Figure 3(e) for a visual comparison with the original wavelet 
coefficients). 


• Nason’s 2-fold cross-validation (2FCV). Nason adjusted the usual 2FCV method—which cannot be applied 
directly to wavelet estimation—for choosing the threshold parameter in wavelet shrinkage [34]. Adapting 
his strategy to our context, we test, for every model of the collection, an interpolated wavelet estimator 
learned from the (ordered) even-indexed data against the odd-indexed data and vice versa. More precisely, 
considering the data Xi are ordered, the selected model m2FCV is obtained by minimizing (13) with V = 2, 
Bi = {2,4,..., n} and i?2 = {1, 3,..., n — 1}. 

• A penalized version of Nason’s 2-fold cross-validation (pen2F). As for the 2FCV, we compute TOpen2F by 
minimizing (14) with V = 2, Bi = {2,4,..., n} and i?2 = {1, 3,..., n — 1}. 

For each method, the model collection described in Section 2.2.2 is constructed by adding successively whole 
resolution levels of wavelet coefficients. Thus, the considered dimensions are {Dm^rn G Al„} = {2-^,j = 
1,..., log2(n) — 1}. Note that unlike the local behaviours of the nonlinear models (e.g. thresholding), these linear 
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models operate in a global fashion since entire scale levels of coefficients are suppressed (see Figures 5(g),4(g) 
for an illustration). 

Typical estimations from a single simulation with n = 4096 are depicted in 4(e) for the Spikes function. 
Figure 4(f) also contains a plot of the excess risk ^(s*,Sm) against the dimension and a vertical shift of 
the curve critsH(w) is also overlayed for visualization purposes. It can be observed that critsH(w) gives a very 
reliable estimate for the risk £(st,s'm), and in turn, also a high-quality estimate of the optimal model. Indeed, 
for all cases, SH consistently selects the best model. 


6.4 Model selection performances 


We compared the procedures on N = 1000 independent data sets of size n ranging from 256 to 4096. As in 
Arlot [3] , we estimate the quality of the model-selection strategies through the following constant 


Cnr — E 




*112 


inf. 




which represents the constant that would appear in front of an oracle inequality. This ratio, which is greater 
than 1, represents the accuracy of the model selection procedure. The average Cor over 1000 replications are 
given in Tables 1 and 2. 


6.5 Results and discussion 


It can be seen from Tables 1 and 2 that none of the methods clearly outperforms the others in all cases. 
However, in our experiments. Mallows’ Cp seems to perform slightly better in many situations, both in the low 
and high noise regimes and for either homoscedastic and heteroscedastic noise. Also, the slope heuristics has 
roughly comparable results with Mallows’ Cp, except for the small sample size case n = 256, where Mallows’ 
Cp performs better, especially in the low noise regime. The quite bad behavior of the slope heuristics in the 
latter case (low noise, small sample size) can be explained by the fact that in such situation, the oracle model 
is the greatest model, that the slope heuristics tries to avoid through the use of the dimension jump. 

In the low noise regime (Table 1), 2-fold penalization is slightly better than 2-fold cross-validation, especially 
when the sample size is small (n = 256). Moreover, 2-fold penalization is competitive with Mallows’ Cp in the 
low noise regime. When the noise is high (Table 2), 2FCV and pen2F give roughly equivalent results. 

Finally, it seems surprising that Mallows’ Cp and the slope heuristics, that are based on linear penalties, 
outperform cross-validation methods in the heteroscedastic noise case. Indeed linear penalties are proved to 
be asymptotically suboptimal in such case, see Arlot [4], while we proved in Theorem 4.2 that F-fold penal¬ 
ization for a fixed V is asymptotically optimal. However, in order to be able to use Mallat’s algorithm for the 
discrete wavelet transform, we restricted ourselves to the 2-fold and this could be the reason for the rather 
mild performances of the cross-validation techniques compared to Mallows’ Cp. Indeed, it is well-known that in 
general, it is better to take F = 5 or 10 instead of 2 (see for instance [6]), because it reduces the variance of the 
cross-validation criterion. Also, Nason’s cross-validation for wavelet models allows to use Mallat’s algorithm, 
but at the price of an approximation of the original cross-validation criterion. These two aspects might be at 
the origin of the superiority of Mallows’ Cp over the cross-validation techniques, at least in the heteroscedastic 
case. 
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S:^ 

a. 

n 

SH 

Cp 

2FCV 

pen2F 



256 

1.980 

± 

0.011 

1.106 

± 

0.008 

1.406 

± 

0.019 

1.034 

± 

0.005 


11 

1024 

1.051 

± 

0.002 

1.031 

± 

0.002 

1.062 

± 

0.002 

1.056 

± 

0.004 

Wave 


4096 

1.021 

± 

0.001 

1.021 

± 

0.001 

1.055 

± 

0.002 

1.021 

± 

0.001 


256 

1.799 

± 

0.009 

1.140 

± 

0.008 

1.341 

± 

0.015 

1.042 

± 

0.005 


12 

1024 

1.021 

± 

0.002 

1.027 

± 

0.002 

1.029 

± 

0.003 

1.084 

± 

0.006 



4096 

1.033 

± 

0.002 

1.032 

± 

0.002 

1.015 

± 

0.001 

1.039 

± 

0.002 



256 

1.482 


0.014 

1.157 


0.005 

1.437 


0.016 

1.084 


0.006 


11 

1024 

1.065 

± 

0.003 

1.023 

± 

0.002 

1.155 

± 

0.006 

1.062 

± 

0.004 

HeaviSine 


4096 

1.011 

± 

0.001 

1.008 

± 

0.001 

1.101 

± 

0.004 

1.010 

± 

0.001 


256 

1.357 

± 

0.012 

1.122 

± 

0.005 

1.357 

± 

0.013 

1.063 

± 

0.004 


12 

1024 

1.048 

± 

0.003 

1.032 

± 

0.002 

1.133 

± 

0.006 

1.093 

± 

0.006 



4096 

1.016 

± 

0.001 

1.013 

± 

0.001 

1.064 

± 

0.003 

1.020 

± 

0.001 



256 

2.890 


0.039 

1.106 


0.008 

1.852 


0.038 

1.072 


0.008 


11 

1024 

2.091 

± 

0.015 

1.064 

± 

0.006 

1.486 

± 

0.022 

1.013 

± 

0.003 

Doppler 


4096 

1.010 

± 

0.001 

1.000 

± 

0.000 

1.141 

± 

0.007 

1.025 

± 

0.003 


256 

2.820 

± 

0.040 

1.127 

± 

0.009 

1.784 

± 

0.036 

1.059 

± 

0.006 


12 

1024 

1.874 

± 

0.013 

1.078 

± 

0.006 

1.419 

± 

0.016 

1.009 

± 

0.002 



4096 

1.024 

± 

0.002 

1.002 

± 

0.000 

1.187 

± 

0.006 

1.019 

± 

0.003 



256 

3.541 


0.071 

1.092 


0.007 

2.075 


0.062 

1.062 


0.010 


11 

1024 

1.077 

± 

0.006 

1.021 

± 

0.002 

1.198 

± 

0.012 

1.045 

± 

0.003 

Spikes 


4096 

1.008 

± 

0.001 

1.008 

± 

0.001 

1.029 

± 

0.002 

1.014 

± 

0.001 


256 

3.236 

± 

0.058 

1.087 

± 

0.007 

2.008 

± 

0.055 

1.071 

± 

0.011 


12 

1024 

1.054 

± 

0.004 

1.013 

± 

0.001 

1.187 

± 

0.012 

1.069 

± 

0.004 



4096 

1.007 

± 

0.001 

1.007 

± 

0.001 

1.009 

± 

0.001 

1.019 

± 

0.002 


Table 1: Comparison of mean performance Cor for each procedure over N = 1000 realizations of the low noise 
level setting with corresponding empirical standard deviation divided by \/N. 



a. 

n 

SH 

Cp 

2FCV 

pen2F 



256 

1.029 ±0.004 

1.016 ±0.003 

1.236 ±0.011 

1.158 ±0.009 


hi 

1024 

1.003 ±0.001 

1.002 ±0.001 

1.002 ±0.001 

1.033 ±0.005 

Wave 


4096 

1.011 ±0.002 

1.008 ±0.002 

1.000 ±0.000 

1.040 ±0.004 


256 

1.076 ±0.006 

1.052 ±0.006 

1.252 ±0.010 

1.244 ±0.012 


h2 

1024 

1.022 ±0.005 

1.014 ±0.004 

1.004 ±0.002 

1.072 ±0.008 



4096 

1.020 ±0.004 

1.019 ±0.004 

1.006 ±0.002 

1.067 ±0.007 



256 

1.096 ±0.005 

1.090 ±0.005 

1.115 ±0.006 

1.185 ±0.013 


hi 

1024 

1.057 ±0.003 

1.054 ±0.003 

1.123 ±0.006 

1.075 ±0.004 

HeaviSine 


4096 

1.029 ±0.002 

1.028 ±0.002 

1.081 ±0.004 

1.041 ± 0.003 


256 

1.155 ±0.009 

1.153 ±0.011 

1.125 ±0.008 

1.300 ±0.020 


h2 

1024 

1.101 ±0.006 

1.091 ±0.006 

1.133 ±0.007 

1.159 ±0.010 



4096 

1.047 ±0.003 

1.046 ±0.003 

1.122 ±0.006 

1.083 ±0.005 



256 

1.330 ±0.011 

1.107 ±0.005 

1.347 ±0.013 

1.043 ±0.003 


hi 

1024 

1.054 ±0.003 

1.025 ±0.002 

1.108 ±0.005 

1.067 ±0.005 

Doppler 


4096 

1.013 ±0.001 

1.014 ±0.001 

1.029 ±0.002 

1.021 ±0.001 


256 

1.224 ±0.010 

1.076 ±0.004 

1.291 ±0.011 

1.053 ±0.003 


h2 

1024 

1.035 ±0.002 

1.031 ±0.002 

1.079 ±0.004 

1.098 ±0.007 



4096 

1.010 ±0.001 

1.009 ±0.001 

1.022 ±0.003 

1.023 ±0.002 



256 

1.156 ±0.009 

1.047 ±0.003 

1.282 ±0.014 

1.076 ±0.005 


hi 

1024 

1.006 ±0.001 

1.005 ±0.001 

1.094 ±0.007 

1.029 ±0.004 

Spikes 


4096 

1.012 ±0.002 

1.010 ±0.001 

1.009 ±0.002 

1.021 ±0.002 


256 

1.119 ±0.008 

1.052 ±0.004 

1.284 ±0.014 

1.126 ±0.006 


h2 

1024 

1.015 ±0.002 

1.014 ±0.002 

1.137 ±0.008 

1.059 ±0.008 



4096 

1.015 ±0.002 

1.011 ±0.002 

1.014 ±0.003 

1.030 ±0.004 


Table 2: Comparison of mean performance Cor for each procedure over N = 1000 realizations of the high noise 
level setting with corresponding empirical standard deviation divided by y/N. 
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7 Proofs 


7.1 Proofs related to Section 2 


Proof of Proposition 2.1. The proof simply follows from the following computations. For every /3 = 


Dm 

< 

fc = l 

oo 


< 


< 


< 

The fact that 


^ E 


2 = 1 
bm. 


E 




< 'VAcmaxllc/Jill xmax|/3z| 


2 = 1 


iGUi 


m E max 1/3/1 

i=l 


Proof of Proposition 2.2. The fact that ; A G A/,^} is an orthonormal family - and thus an orthonor¬ 

mal basis of TO - is a classical fact of wavelet theory (see for instance [19]). Take to > 0 such that 

supp (V'o) U supp ((/>o) C [0, to] . 

For 1 > 0 and 1 < fc < 2^, we have 

Cfc < (N + 2) ||V',..|loo < (N + 2) 2^/2 ||^„||^ ^ 

oo 

where [to] is the integer part of to. We thus take Aj = 2-1 for j > 0 and A_i = 1, which gives 


Om. 

^ ^1 + \/ Dm, 


since Dm = 2''™+h By taking = max{(]TO]-I-2) ||V'o|loo > 1 + we thus get, for any j > — 1 and 

fee {i,...,2^}, 


if) 


per 

j,k 


— k'm \/^j and 'y ) \f~A^ ^ Vm\f~Dm • 


It remains to prove that there exists Ac > 0 such that, by denoting for p, G A/,^ and j G {—1,0,1,..., to}. 


= {a e A(j) ;supp(?/’/x)n®'^PP(^^) ^ ’ 

one has 

max Card (Aju ) < Ac (AjA“^ V l) . (27) 

fieA(i) 

Take jo = max {[log2 (to)] -1-1,0}. Then for all j > jo and k G {1,..., 2 ^~^°}, supp (V'j.fc) C [0,1). Furthermore, 
for every k G {l,... ,2^-^°} set T (k) = {2^-^°l + k;l G {0,...,23o _ i}}. Then {T{k)-kG {l,... ,2J-J«}} 
form a partition of {l,..., 2-1} and for fc, /c' S {l,..., 2l“A}, k ^ k', 


supp Pi supp {ipj,k') = 0- 


It is then easy to see that taking Ac = 2^° gives (27). ■ 


7.2 Proofs related to the slope heuristics 

We first notice that, from [37], Section 5, Theorems 3.1, 3.2 are valid under the following general set of assump¬ 
tions (i.e. by replacing (SA) by (GSA) in the statement of the theorems): 

General set of assumptions: (GSA) 

Assume (PI), (P2), (P3), (Ab), (An) and (Ap„) of (SA). Furthermore suppose that. 
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(Alb) there exists a constant tm such that for each m € Mn one can find an orthonormal basis 
satisfying, for all {I3k)k=i € 

< rMV^\P\ao > 

where |/3|^ = max {|/?fe|; fc e {I,..., Dm}}- 

( ACqo ) a positive integer ni exists such that, for all n>ni, there exist a positive constant Aeons and an event 
iloo of probability at least 1 — , on which for all m € Ain, 


Sm — ^cons 

Now the proofs of Theorems 3.1 and 3.2 simply rely on the fact that assumptions (Alb) and (Acqo) in 
(GSA) are ensured under (SA). Indeed, assumption (Alb) in (GSA) is satisfied under assumption (Auslb) 
in the set of assumptions (SA), see Proposition 2.1. Furthermore, Theorem 5.1 shows that assumption (Acoo) 
in (GSA) is also satisfied under assumption (Auslb). 


Dm Inn 
n 


Dry, 


7.3 Proofs related to V-fold procedures 

7.3.1 Proofs related to V-fold cross-validation 

Theorem 4.1 is a straightforward consequence of the following result, that will be proved below. Recall that the 
set of assumptions (GSA) is defined in Section 7.2 above. 


Theorem 7.1 Assume that (GSA) holds. Let r G (2,+oo) and V G {2,...,n — 1} satisfying 1 < V < 
r. Define the VFCV procedure as the model selection procedure given by (12) and (13). Then, for all n > 
no ((GS'A) ,r), with probability at least 1 — Li^csA),rn~'^, 




Proof of Theorem 7.1. All along the proof, the value of the constant T(GSA),r niay vary from line to line. 
We set 

crit^FCV i'm) = critvFCV - yYl (l (s*)) ■ 

It is worth noting that the difference between crityp^v (nr) and critypcv (m) is a quantity independent of m, 
when m varies in Ain- Hence, the procedure defined by crityp^y gives the same result as the VFCV procedure 
defined by critypcv- It will be convenient for our analysis to consider critypcy instead of critypcv- 
We get for all m G Ain, 


critypcv (nr) 


-7(s*)) 

y (t' (^m - 7 (Sm)) 

i=i 

+ [Pn^ - P) (7 (Sm) - 7 (S*)) + P (7 (Sm) “ 7 («*)) 
I (^s*, + Ay (m) + 8 (m) 


where 

Av (nr) = :^ XI - 7 (Sm)) - P (7 (^m “ 7 (Sm)) , 

i=i 

and 5 (to) has been defined in Lemma 7.5. Furthermore denote 


(28) 


v[ (nr) = P (7 (sL - 7 (sm)) and p^ (to) = P^ (7 (s^) - 7 (sL ^^) ) • 


Let 0„ be the event on which: 
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• For all models m G Ain of dimension such that (Inn)^ < Dm, it holds 


Pj (to) — E 

P2 ("^) 

< i(GSA),ren (w) E 

(-1) / \ 
P2 ("l) 

P2”^^ (to) -E 

P2”^^ (to) 

< L(GSA),r^n ® 

(~1) / \ 
P2 (^) 


together with 


|Ay(m)| < L(GSA).ren (w)E (m) 

|()(m)| < - J== -1- iv(GSA),r 


VDm 


VDm 




(m) 


For all models m G Ain of dimension Dm such that Dm < Am,+ (Inn)^, it holds 


|Av (to)| 

< 

L(GSA),r 

5 (to) 

< 

L(GSA),r 

(- 1 ) ( ^ 
P 2 (™) 

< 

L(GSA),r 

Pi (^) 

< 

L{GSA),r 


(Inn)^ 


Dm. V In 


^(s*,Sm)lnn lnn\ 
n n j 

n . , (Inn)^ 


n 

Dm V Inn 


A i(GSA),r- 


n 


< A(GSA),r 


(Inn)^ 


(29) 


(30) 

(31) 


(32) 


(33) 


By Theorem 2 of [36] and Lemma 4 of [37] applied with a = 2 + oai sample size ny = n{V — 1) jV, 
Corollary 7.4 and Lemma 7.5 applied with 0 = 2 + aM, we get for all n > no ((GSA), r), 

P (f^n) > 1 — A(GSA),r ^ ^ > 1 ~ -^(GSA),r’^ ^ ■ 

ra^^An 


Control on the criterion critypo^y for models of dimension not too small: 

We consider models m G Ain such that _|_ (Inn)^ < Dm- 


crit^FCV (™) = ^ XI (7 - 7 (s*)) 

= ^ X (^m - 7 (Sm)) 

+ (f'n^ - P) (7 (Sm) - 7 (S*)) PPil (Sm) “ 7 (s*)) 


+Ay {m)+6{m) 


By (29), (30) and (31) we have on 


max{|Av (rn)| , (m)|} < L(GSA).ren (w) (s*, s^) + E p^ (rn) 


< 


L(GSA),r£n (m) ^ (s*, s)„ 


Hence, identity (28) gives 


CritypQy 


(to) - e (s*,s^ < L(GSA).r-en (to) f ^ . 


(34) 


Control on the criterion critypp.y for models of small dimension: 

We consider models to G A4„ such that Dm < Am.+ (Inn)^. By (32), (59) and (33), it holds on n„, for any 
r > 0 and for all to G A4n such that Dm < Am.+ (Inn)^, 
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critvFCV 


< ^(GSA),i 


(In nY 


^(GSA),r 


^ (s*, Sm) Inn ^ Inn 
n n 


(Inn) ^ / _i N Inn 

< L{GSA),r - - -^ ^(GSA),rT« (s*, Sm) + [T +1) -^(GSA) 


Hence, by taking r = (Inn) ^ in the last display we get, 

critvFCV i'm-) - ^ < i(GSA).r 


' (s*,sL 

(Inn)^ 


+ 


Oracle inequalities: 

We exploit the following inequality, that defines the selected model in, 

critvFCV {crit%cv (n^)} ■ 

meMn 

Indeed, using (34) and (35), we get that on f2„ it holds. 


crit 


0 

VFCV 


( to ) 


> 1 - L 


> 1 - 


(GSA),r 


'b(GSA),T 


sup 


(Inn) > Ajvt,+(In n)^ 


£n (m) 




(GSA),r“ 


(Inn)^ 


nn 




- L 


(GSA),r- 


(Inn) 


Furthermore, using again (34) and (35), we get 

inf {cri4FCv("^)} 


mGA4r, 


< 1 + 


L 


(GSA),r 

Vlnn 




(Inn)^ 


Putting (37) and (38) in (36), we get that for all n > no ((GSA),r), 




< 1 


1 - 

A(GSA),r 

Vlnn 

1 + 

L(GSA).r 

Vlnn 


+ Jm. {' (“••*'’) } + ^iaSA).rYY' 


(Inn)^ 


(35) 


(36) 


(37) 


(38) 


This concludes the proof of Theorem 7.1. 

7.3.2 Proofs related to V-fold penalization 

Recall that the set of assumptions (GSA) is defined in Section 7.2 above. The proof of Theorem 4.2 will be 
based on the following theorem, proved in [37] - see Theorem 2 and its proof under (GSA) therein. 

Theorem 7.2 Suppose that the assumptions (GSA) of Section 3.2 hold, and furthermore suppose that for some 
S G [0,1) and Ap, Ar > 0, there exists an event of probability at least 1 — Apn~^ on which, for every model 
TO G Ain such that > ^Al,+ (Inn)^, it holds 

|pen(TO) - 2E [P„ {j {Sm) - 7(Sm))]| < ^ (^ (s*,Sm) + E [P„ {j {Sm) - 1 (Sm))]) 

together with 


I pen (to) I < Ar 


(s*, 5m ) , (Inn) 
(Inn)" 


+ 


n 
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Then there exist an integer no only depending on 5 and /3+ and on constants in (GSA), a positive constant 
Lo only depending on cm given in (GSA) and on Ap, two positive constants L 4 and L5 only depending on 
constants in (GSA) and on A^ and a sequence 


On < 


u 

(In 


such that it holds for all n > no, with probability at least 1 — L^n 


t (S:);, Sjn) ^ 



50n ] 
(1-SfJ 


inf (a "^m)} “t“ -^5 
me A1„ 


(Inn)^ 

n 


We now prove Theorem 4.2. 
Proof of Theorem 4.2. We set 


peng (m) = penyp [m) - (j («*))) ■ 

i=i 

It is worth noting that the penalization procedure defined by peng gives the same result as the procedure defined 
by penyp. It will be convenient for our analysis to consider peng instead of penyp. Our strategy is to derive 
Theorem 4.2 as a corollary of Theorem 7.2 applied with pen = peng. 

As P„ = (1 - we get for all m € A4„, 

peng(m) = XI (-f’" ^ ^ ) 

i=i 

= - 7 ( 5 *)) -7(s*))) 

i=i 

= - 7 (Sm)) - (7 - 7 (Sm))) 

+ "' 172^51 {{^n^ ~P) (7(Sm) -7(S*)) - -P) (7(Sm) -7(S»))) 

V -I / _ - \ 

= y (^Pi (m) + P 2 (to) + S(m)-S (to) j 

where 

Pi ^ X - 7 (Sm)) , P 2 (to) = :^ X M “ 7 ) ’ 

i=i j=i 

and d (to) and S' (to) have been defined in Lemma 7.5. We also set 

Pi (to) = P(7(Sm) - 7(sm)) and Pa (to) = P„ (7 (s^) - 7 (sm)) ■ 


Let fin be the event on which: 

• For all models to € Ain of dimension Dm such that Am.+ (Inn)^ < Dm, it holds 

|Pi (to) - E [P2 (to)] I < L(GSA)en (to) E [P2 (to)] 

|P2 (to) - E [P2 (to)] 1 < L(GSA)4 (^) E [P2 (to)] 
where e„ (to) is defined in Theorem 5.2, together with 


V 

PiM ^_^e[p2(to)] 

< 

P(GSA),ren (to) E [p 2 (to)] 


V 

P 2 M ^_^E[P2(to)] 

< 

P(GSA),r4 (^) E [P2 (™)] 


max { S (to) , y (to) } 

< 

£(s^,Sm) , p Inn 

+P(GSA),r E[p 2 (to)] 

V -^m V -^m 

(39) 
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• For all models m G Ain of dimension such that Dm < Am,+ (Inn)^, it holds 


max {|(5 (m) I , |(5'(m.)|} < 

P2 (w) < 

Pi (m) < 


L(GSA).r 

'^(GSA),r 

L(GSA),r 


Ii (s*, Sm) Inn ^ Inn 
n n 


Dm, V In 1 


< L{GSA),r 


(Inn)^ 


(Inn) ^ Dm V Inn 


< -^^(GSA),r 


(Inn)^ 

n 


(40) 

(41) 

(42) 


By Theorem 2 of [36] and Lemma 4 of [37] applied with a = 2 + aM and sample size ny = n (t^ — 1) /V, 
Corollary 7.4 and Lemma 7.5 applied with a = 2 + olm, we get for all n> Uq ((GSA), r), 

P(L!„)> 1 -L ^ > I - . 

mGAin 


We consider models m S Ain such that Aji 4 ,+ (Inn)^ < Dm- Notice that (39) implies by (18) that, for all 
m € Ain such that (Inn)^ < 


max{|(5(TO)| , |(5 '(to)|} < L(GSA).r 


1 \ 1/4 

(Inn)^ Inn 


Dn 


— X (^(s*,Sm) +E[P2 (w)]) 

-^TTJ. / 


^ -^(GSA),r^n (j^) (s * 5 )+E[p 2 (m)]). 

We deduce that on we have, for all models m G Ain such that Aj \4 + (Inn)^ < Dm and for all n > 
no ((GSA),r). 


< 


Ipeug (m) — 2E [p 2 (m, 
F- 1 


V 


f - V _ V \ 


+ max {I S (m) | , | <5^ (nr) |} 

^ L (GSA),r^n (m) {£ {s *■ ; Sfil )+E[p2 (m)]) 


(43) 


Let us now consider models m G Ain such that Dm < ^ai,+ (Inn)'’. By (40), (41) and (42), we have on fin, 

V-1 


|peno(TO)| = 


V 


P]^ (m) + P 2 (m) + S (to) — (5 (to) 


< -^(GSA).r 


< -^(GSA).r 


^(s»,Sm)liin ^ (Inn)'’ 
n n 


£(s*,Sm) , (Inn)^ 


(Inn)^ 


(44) 


Inequality (44) implies that inequality (10) of Theorem 3.2 is satisfied with Ar = L(GSA),r- From (43) and (44), 
we thus apply Theorem 7.2 with Ap = Laj,,cm, and this gives Theorem 4.2 with 


dn — L(gsa), 


((Inn) sup < (to) ; (Inn)^ < 

V mGAin ^ 


- Dm < n 


ri+l/il+l3+)\ 




7.4 Proofs related to Section 5 

7.4.1 Proofs for strongly localized bases 
Proof of Theorem 5.1. Let C > 0. Set 

:={sGm; ||s - s™||^ < C} 

and 

:={sGm; ||s - > C} = m\T^. 
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Take an orthonormal basis {ipk)k=i of (w, 11-112) satisfying (Aslb). By Lemma 7.7, we get that there exists 
r a > 0 ^i^oh that, by setting 


ill = 




we have for all n > no (A+), P(ni)>l — n Moreover, we set 


flo — 


max^ \iPn - P) {Vk • >Pi)\ < min{||v3fc||^ ; ||(/?;|loo} \/^ 


(/c./)e{i,...,D„} 


where is defined in Lemma 7.6. By Lemma 7.6, we have that for all n > Uq (A+), P(n 2 ) > 1 — n “ and 

so, for all n > no (A+), 

P > 1 - 2n"“ . 

We thus have for all n > no (A+), 

P(||s,„ - S^lloo > C") 

< P inf Pn (7 (s) - 7 (Sm)) < inf P„ (7 (s) - 7 (s^)) 


sGJ^' 




sup (7 (Sm) - 7(5)) > sup P„ (7 (Sm) - 7(5)) 






< 


sup p„ (7 (Sm) - 7 (s)) > sup P„ (7 (Sm) - 7 (s)) i n ^2 )+2n “. 

J J 


Now, for any s € m such that 


we have 


D„ 


s - Sm = ^ Pk^k, P = i.l3k)k=l 


Dm ^ TD-C>„ 




(45) 


Pn (l(Sm)-l{s)) 

= {Pn - P) (V'm • (Sm “ s)) “ {Pn “ P) ((s “ Sm)^) - P (7 (s) - 7 {Sm)) 

Dm Dm Dm 

= '^Pk {Pn - P) (V'm -Vk)- ^ PkPl {Pn “ P) {<fk ' ‘Pi) - PI- 

k,l^l 




k^l 


We set for any {k, 1) G {I,..., Dm} , 


Pni = (Pn - P) ii’m ' Pk) and , = (P„ - P) {ifik ’Pi)- 


Moreover, we set a function hn, defined as follows, 

Dm 


Dr, 


Dr, 


hn-.p = ^ - E 


k=l 


k,l=l 


We thus have for any s G m such that s — Sm = PkPki P = {Pk)k=i € 

Pn (7 (Sm) - 7 (s)) = hn {P) ■ 

In addition we set for any /? = (/3fe)Ei ^ 


k=l 
Dm ^ tuD™ 


( 46 ) 


I/3L.OO =rm^\/A,m&^\j3k\ 

Z=1 
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It is straightforward to see that is a norm on We also set for a real Dm x Dm matrix B, its 

operator norm |i-B||„ associated to the norm on the Dm-dimensional vectors. More explicitly, we set for 

any B e 

non m,oo 

\\B\\m ■= sup —-. 

,SeR°, 0^0 \P\ 771,00 

We have, for any B = {Bk,i)k,i=i^...Drr, ^ 


sup 

% I^L,oo-i 


sup 

/3gIR^-, \I3\^ 


• X! X! Bk,iPi 


^^yi^max ^ ^ Bk,il3i 

i—1 ^ j—1 iGUj 


/36K"™, l/5L.oo = l 


= y \/Ai max < max 
^ fcen. je{i... 


£ ^Pm 1 pax|/3;| Ja^ \Bk,l 




Notice that by Inequality (5) of (Aslb), it holds 


Tyc cisem; s- Sm = Y ^ 


^C/2 oisem; s-Sm = Y ^ <C/2 \ . 


Hence, from (45), (46) (47) and (48) we deduce that if we find on fli p| 172 a value of C such that 


sup hn (/3) < sup hniP) , 

/3eR^^, |/3|^ ^>C /3GR^m, |/3|^ ^<C/2 


then Inequality (17) follows and Theorem 5.1 is proved. Taking the partial derivatives of h„ with respect to the 
coordinates of its arguments, it then holds for any {k,l) € {!,..., Dm}^ and /3 = (ft)^”) € 

^ (/3) = - 2 Y - 2/3fc (49) 

We look now at the set of solutions /? of the following system. 


(/3) =0 , VfcG {1,...,77™}. 


We define the Dm x Dm matrix Rn to be 


«?> := («SP,,..., 


and by (49), the system given in (50) can be written 


2(Id^+R^^^)p = R^^\ 


where Rn'^ is a 17-dimensional vector defined by 


R'^n^ = (<l), 
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Let us give an upper bound of the norm 
On ^2 we have, using (6), 


R 


( 2 ) 


, in order to show that the matrix Id„^ + Rn is nonsingular. 




> V Ai max < max 
^ fcen. I ie{i,...,&„} 


> V Ai max < max 
^ fcen. I ie{i.. 


< 


A- E 

A" E |«s,. 

Vv/^maxJ max J OA"^Card (n^|fc) max|(P„ - P) ((^a, • (^i)| U 

^ kGUi J J 


< AcL^l \j ^ max ^ i; , i , 
”"V n 1 V Aj 


4^ f 4^ V 1 1 Omin{A,;Aj} 


2=1 ■ 


We deduce from (4) and (51) that on 




< La. 


.a.Vm 


Af, Inn 


(51) 


(52) 


Hence, from (52) and the fact that b'^Ab^ < we get that for all n > no (A+, Ac,r^,a), it holds on 

f^2, 




1 

< - 
- 2 


and the matrix (/_d„ + Rn^'j is nonsingular, of inverse (/d„ + Rn^'j = (—• Hence, the system 

(S) admits a unique solution given by 




Now, on Oi we have by (4), 


<r^(y] '/a ) , max l(Pn - R) ■ fk)! < rmLA^ r c 

m,oo \ ^ 


Pm Inn 


and we deduce that for all no (A_|_, Ac, rm, o;), it holds on O 2 P Oi, 


/ 3 O 


1 

< - 
m.oo 2 




-1 




(1) 




Pm. In 5 


(53) 


Moreover, by the formula (46) we have 

hn W) = Pn {l{Sm)) “ Pn fv - ^Pk^k 


D„ 


fc=l 


and we thus see that hn is concave. Hence, for all no (A+, Ac, rm, a), we get that on 112, is the unique 
maximum of hn and on 112 H ^i' t)y (53), concavity of hn and uniqueness of /3^”\ we get 


hn = sup hn (/3) > sup hn (/3), 

V / /3eK"™, |/3|^_^<C/2 /3eR^^, |/3|^_^>C 

with C = 2rmP^c™ which concludes the proof. ■ 

Remark 7.1 The proof of Theorem 5.1 can he adapted for models endowed with a localized basis structure. 
Indeed, if we set for any B G , 


i^llm := sup 


\BPl 


sup 


|S/3|c 


/3gK"™./35^0 ip I m,oo /3eK"™./35^0 IP I 00 
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then we have, the following classical formula 


Now, it holds. 


ISIL = 


max 





= max 


\iPn - P) i^k ■ ^l)\ 


’ ™ fee{i....,D„} 


min{||v3fc||^;||v?,||^}y'^ 




< r 

_ ' m^oc Trr 


Df, Inn 


n 


T/ie previous bound tends to zero if Dm < n^!'^ jlr^in) and this is the essential reason why results for localized 
bases are restricted to models with dimension lower that llr?{n) while for strongly localized bases we can go 
as far as Dm < n/ln^{n) (see also Remark 5 . 1 ). 


7.4.2 Proofs related to excess risks’ representations 
Proof of Proposition 5.3. Let us write C* := T (sm)- It holds 

inf P„ (7 (s)) = inf P„ (7 (s)) 

sGdc,t, sGm 

<min| inf P„(7(s))|, 

which readily proves Formula (21). Formula (22) is a direct consequence of (21), since me = Ur<c '^C'- ■ 
Proof of Proposition 5.5. We will only prove the case where Rq = +00. Then the situation where Rq G M+ 
can be deduced easily by noticing that the subset {s G m ; ^ (s) < Rq} of m actually plays the role of m in this 
latter case. 

When Rq = +00, we have with the notations of Proposition 5.3 and by taking R = P (7 (•) ~ 7 (sm))) 
dc = dc and rhe = me- From formula ( 21 ) we thus get 


P(7(sm)-7(sm)) € argmin<^ inf Pn(^{s)) 

C>o fsedc 


= argmax I sup P„ (7 (s^) - 7 (s)) 


00 I 


= argmax < sup (P„ - P) (7 (s^) - 7 («)) “ C* f • 

- [sGdc ) 

Hence, Formula (23) is proved. Now, for (24), take any C > 0 and notice that there exists a random variable 
CiG [0, C] such that 

sup {(P„ - P) (7 (s„) - 7 (s))} -C = sup (P„ - P) (7 (sm) - 7 (s)) - C 

s&rnc sGdci 

< sup {Pn - P) {"1 {Sm) - 1 {s)) - Cl 

s&dci 

< sup (P„ - P) (7 (sm) - 7 (s)) - C* , 
sGde, 


where C* := P (7 (Sm) - 7 (sm))- Taking C = C*, we get 

sup {(P„ - P) (7(Sm) - 7(5))} - C* < sup (P„ - P) (7 (Sm) - 7 (s)) - C* 


(54) 
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and since dc, C me,, this implies 

sup {(F„ - F) (7(sm)-7(5))} - C* = sup {Pn-P){l{sm)-l{s))-C^. 
seme. sedc. 

Together with ( 54 ), the latter equality gives that for any C > 0 , 

sup {(P„ - P) (7 (sm) - 7(s))} - C'< sup {(F„ - F) (7 (sm) - 7(5))} - C*, 
s^mc s^mc^ 


which is another way to write (24). 

Now, considering the case of the empirical excess risk, we could again apply Proposition 5.3, but we will 
follow a more direct proof. We have, by definition of 'sm, 

-^emp ('^mi ^m) — Pn (T (^m) 7 (-^m)) 

= max{F„ ( 7 (s„) - 7 ( 5 ))} • 

s^m 


Now, as {G (sm) < Ro} = Uc>o = Uc>o '^Cj we get 

^emp ^m) — Iliax ^Pn (t (^m) 'T ('^))} 
sGm 

= max sup {Pn (7 (sm) - 7 (s))} 
sedc 

= max I sup {{Pn - P) (7 (sm) - 7 (s))} “ C' [ ’ 

- (sedc J 

that is (25). Now formula (26) follows from the kind of arguments that allow to prove (24) based on (25). 
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Appendix 

7.5 Some lemmas instrumental in the proofs 

We gather here the lemmas that are used in the proofs of Section 7. 

In the next lemma, we apply Lemma 9 of [37], with n 2 = n/V, ni = n — n 2 = n (1 — 1/V), r = 2 and we 
set r = c~^ S (l,-|-oo). Furthermore, the notations (m) used in [37] correspond respectively to the 

quantities Pn^ and s^^'^. 


Lemma 7.3 Assume that (GSA) holds. Let r G (2, -|-oo) and V G {2,..., n — 1} satisfying 1 < V < r. Then 
there exists L = L(^qsa) j. > 0 such that for all m G Ain satisfying Dm > ^Ar,+ (Inn)^, by setting 


In? 


it holds for all n > uq {{GSA) ,r) and for all j G {1,..., V}, 

P ( Pn'' (7 - 7 (Sm)) - (7 “ 7 (Sm)) > 4^^ (w) ^ P^^^ (w) ) < 12n"^"“-^ , 

where P 2 {i^) = Pn ^7 (sm) — 7 ^ If Dm < A_\4,+{^rin)^, then for all n > no {{GSA) ,r), 


Pn'’ (7 (4n - 7 (Sm)) “ ^ (7 (4^ “ 7 (^m)) 


> L 


(Inn) 


< 12n"2"“^. 


Taking into account the averaging between the blocks of the F-fold, we get from Lemma 7.3 the following 
corollary. 

Corollary 7.4 Assume that (GSA) holds. Let r G (0,1) and V G {2, ...,n—1} satisfying 1 < V < r. 
Then there exists L — L(^qsa) j. > 0 such that for all m G Ain satisfying Dm > ^Ai,+ (Inn)^, it holds for all 
n>no {{GSA) ,r), 


P 


(7 (4n - 7 (Sm)) - ^ iZ Pn^ {'I (4^ “ 7 (Sm)) 


i=i 


> Len{m)E P 2 {m) 


< 22rn 


— 2 —ayvi 


(55) 

where P 2 (to) = Pn (7 (sm) — 7 ( 4 n and Sn (to) is defined in Theorem 5.2. If Dm < ^Ai,+ (Inn)^, 
then for all n > uq {{GSA) ,r), 


P\ 


(7 (sL - 7 (Sm)) - ^ ^ Pn'’ (7 (4i - 7 (Sm)) 


i=i 


> L 


(luTr)^ 


<22rn-2-“^. (56) 
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Proof. First we prove the following inequality, 

p(|^EJ=i-P (7 -7(sm)) - (7 - 7(sm))| > {'m)E (m)j^ < 

(57) 

Indeed, it easily derives from Lemma 7.3 together with a union bound along the V blocks, taking advantage of 
the following formula 


< max 


^ P (7 - 7 (s„)) “ ^ ^ ^ 

i=i t=i 

Pn^ (7 - 7 (Sm)) - 7^ (7 - 7 (Sm) 


Then, we show that the quantity y E]Li P ^7 — 7 (sm)^ is close enough to P ^7 — 7 (sm)^ 

with probability close to one. Indeed, it holds for any C > 0, 


(7 (^m - 7 (Sm)) - y'^P {'y -1 (^m) 

i=i 

("7 (^^0 “ - T’ (7 - 7 (Sm)) 


> c 


< 


i=2 


max 


> C 


P 


(7 (sL - 7 (Sm)) - P (7 (4^ - 7 (Sm) 


V' 

< -7(sm)) -p( 7(^E^^) -7(sm)) >c) 


> c 


i=2 


i=2 


^p( P (7 (^m - 7 (Sm)) 


n 


P 


(7 -7(Sm) 


> C 


< 


2V'p(|p(7(7r‘>)-7M)-%|>2). 


Hence, from Theorem 2 of [36] applied with a = 2 + and sample size equal to ny = nV/ {V — 1), we get 
that by taking 

C = 2£nv {m) ^ < P(GSA),ren (w) E p^"^^ (m) , 


it holds 


7 (^m - 7 (Sm)) - ^ XI (7 (4^ - 7 (Sm)) 


i=i 


> C < 


(58) 


Inequality (55) now follows from combining (57) with (58) and noticing that £n'^ (m) < P(GSA),T-£n (n^). In¬ 
equality (56) also derives from Lemma 7.3 with the same type of reasoning and further details are left to the 
reader. ■ 

Lemma 7.5 Let a > 0. Assume that (GSA) is satisfied and that 1 < H < r. Then by setting 

^ (™) = ^ X ~ (7 (s™) - 7 (s*)) and F (n^) = ^ X “ P) (7 (^m) “ 7 {s*)), 


i=i 


f=i 


we have for all m S A4n, 


max{|^(m)| , |<5'(m)|} > P(GSA).r 


.^(s*,Sm)lnn Ini 


< 2rn 


(59) 


Furthermore, for all m G Ain such that (Inn)^ < Dm and for all n > uq ((GSA), a), we have 


max {|(5(TO)|,|y(TO)|}> 


y/Dm 


^(GSA),r 


Inn 

y/Dm 


E 


(-1) ( \ 

P2 (n^) 


< 2rn 
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where (to) := (^7 (s^) - 7 > 0 . 

Proof. Notice that for any C > 0, 


(|<5 (to)| > C) < P - p'j (7 (s^) - 7 (s*)) > 


< 


^p(|(pp)-p) (7(s,n)-7(s*))| >c). 


i=i 


Then use j times Lemma 5 of [37] with a sample size equal to n/V in order to control the summands at the 
right-hand side of the inequality in the last display. The same reasoning holds for | S' (to) |. Further details are 
left to the reader. ■ 

Lemma 7.6 Let a > 0. Consider a finite-dimensional linear model to of linear dimension D and assume that 
{<Pk)k=i ® localized orthonormal basis of {m, 11-112) with index of localization r^. > 0. More explicitly, we thus 
assume that for all fd = {l3k)^Pi G 


Dr, 




/c=l 


< Tm^/Dm |/3|oo • 


If (Ab[m)) given in Theorem 5.1 holds and if for some positive constant A+, 


Dm ^ 2 ’ 

(Inn) 


/ oN 

then there exists a positive constant Laf^ such that for all n > uq (A_|_), we have 

(^(fe P ~ niin{||<Pfe|loo ; Il'T’dlool ^ ^ 

Proof. For any (fc, Z) G {1,..., Dm}^, we have 

E {(fik-TifiX) < min|||<pfe||^;||(pi||^| 

and 

WTk-TiWoo ^ min{||(/?fc||^;||v?z||^} X max{||(pfc||^;||(p,||^} 

< min{||(^fc||^;||v5/||<^} X r^V^Z^. 

Hence, we apply Bernstein’s inequality (see Proposition 2.9 in [32]) and we get, for all 7 > 0, 


E - P) {ifk ■Ti)\> min{||(pfc|loo ; llT’dlool 
Since, for all n > uq (A_|_), 


27 Inn rmy/Dml'^nn 
n 3n 


< 2n-'^. 


V Dm In n ^ ^ m _ /h^ 


nn 


< r„ 


Inn 

n 


we get from (61) that for all n > ng (A_|_), 


< 


< 


i)en^^D P ~ “ (v^+ min{||v?fc||^ ; ||(pi||,,„} 

- {V^+ min{||v5fc||3„ ; WtiWX \l 


Inn 

n 


(fc.Z)G{l.....D„}2 


Inn 

n 


E n 

{k,i)e{i,...,Dr„p \ 

< 2D^n-'^ < n-^+^. 


\{Pu - P) {Tk ■ Ti)\ > min{||v9fe||^ ; ||v?z|loo} 


27 Inn rm^/Dml\nn 


3n 


(60) 


(61) 


We deduce from (62) that (60) holds with La}r„, = V2a -|- 4 -|- (a -|- 2) rm/3 > 0. 


(62) 
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Lemma 7.7 Under the assumptions of Lemma 7.6 there exists a positive constant ^ such that for all 

n > no (^+), we have 


P [ max |(P„ 


P) (V'm • </ 5 fc)| > 



< n 


where i/'m {x,y) = -2{y- Sm (x)). 

Proof. Let /3 > 0. Notice that by (Ab(m)), 


\i^m{X,Y)\<AA a.s. 


Then by Bernstein’s inequality, we get by straightforward computations (in the spirit of the proof of Lemma 
7.6) that there exists L(l)^ p P ^ that, for all A: G D^}, 


\iPn - P) (V’m ■7>k)\> < n 


Now the result follows from a simple union bound with /3 = a + 1. ■ 


7.6 Additional simulation results 

This section provides additional simulation results to those in Section 6. Figure 5 is an analogy to Figure 4 which 
illustrates the difference between the test functions Spikes and Wave for a smaller sample size (i.e. n = 1024). 
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(g) 


Figure 5: (a)-(d): Noisy version of Wave for each ct(-) scenarios, (e): Typical reconstructions from a single 
simulation with n = 1024. The dotted line is the true signal and the solid one depicts the estimates sAsh- 
(f): Graph of the excess risk against the dimension Dm and (shifted) critsH(w) (in a log-log scale). 

The gray circle represents the global minimizer fh of critsH(w) and the black star the oracle model m*. (g): 
Noisy and selected (black) wavelet coefficients (see Figure 3(e) for a visual comparison with the original wavelet 
coefficients). 

































